Lecture 5 NGS & File Formats

05/28/2020

Goals

  • Understanding NGS file formats

Goals

  • Understanding NGS file formats

  • Understanding NGS quality assessment

FASTA

  • FASTA format reports a sequence

FASTA

  • FASTA format reports a sequence

  • Can contain protein sequences or nucleic acid sequences

FASTA

  • FASTA format reports a sequence

  • Can contain protein sequences or nucleic acid sequences

  • Common applications include

    • Reference Genome
    • Gene Sequences

FASTA

FASTA

  • Starts with a sequence header, and follows with the sequence itself

FASTA

  • Starts with a sequence header, and follows with the sequence itself

FASTQ

  • Very widely used
  • Delivered from the sequencer

FASTQ

  • Very widely used
  • Delivered from the sequencer
  • Similar to FASTA
    • Different header format
    • Includes quality scores
  • Entry to Alignment / QC

FASTQ

Nearly everything works with this format. Some common examples are:

  • Aligners
    • Bowtie, Tophat2
  • Assemblers
    • Velvet, Spades
  • QC tools
    • Trimmomatic, FastQC

FASTQ

  • Four lines per entry
    • Sequence Header
      • @ to whitespace = sequence identifier
      • whitespace to line end = sequence description
    • Sequence
    • +
    • Quality Scores

FASTQ

  • Four lines per entry

FASTQ

  • Four lines per entry

FASTQ

  • Quality Scores
    • Score is 0 - 40, represented by ASCII sequences. Primariliy with an offset of 33

https://en.wikipedia.org/wiki/ASCII

FASTQ

  • Quality Scores
    • Score is 0 - 40, represented by ASCII sequences. Primariliy with an offset of 33
    • Q = \( -10 \log_{10} P \)
    • P = \( 10 ^ {-Q/10} \)

Phred quality score (Q) Probability of incorrect call (P) Base call accuracy
10 1 in 10 90%
20 1 in 100 99%
30 1 in 1000 99.9%
40 1 in 10000 99.99%
50 1 in 100000 99.999%

FASTQC

SAM / BAM / CRAM

Sequence Alignment Map (SAM)

SAM / BAM / CRAM

Sequence Alignment Map (SAM)

  • Standardizes how alignments are reported
    • Alignment

SAM / BAM / CRAM

Sequence Alignment Map (SAM)

  • Standardizes how alignments are reported
    • Alignment
    • Quality Scores (mapping + base quality)

SAM / BAM / CRAM

Sequence Alignment Map (SAM)

  • Standardizes how alignments are reported
    • Alignment
    • Quality Scores (mapping + base quality)
    • The original reads from the FASTQ

SAM / BAM / CRAM

Sequence Alignment Map (SAM)

  • Standardizes how alignments are reported
    • Alignment
    • Quality Scores (mapping + base quality)
    • The original reads from the FASTQ
    • Paired end information

BAM - compressed searchable binary SAM

CRAM - even smaller compressed searchable binary SAM

SAM / BAM / CRAM

Sequence Alignment Map (SAM)

Fully Described in a specification

Complex header - many optional fields Some include:

  • command that generated the SAM file
  • SAM format version
  • sequencer name and version

Example of the first few fields of the SAM header

SAM / BAM / CRAM

Where is it used?

  • Alignment algorithms
  • Some assemblers
  • CRAM/unaligned Bam (uBAM) can be a source of data delivery in some institutions: this cuts down significantly on storage space and transfer speed.
  • Alignment viewers
  • Variant detection algorithms

SAM / BAM / CRAM

Sequence Alignment Map (SAM)

SAM / BAM / CRAM

11 manditory fields

SAM / BAM / CRAM

Flags can tell you about each read, and allow for summaries on the file, and filtering.

SAM / BAM / CRAM

Flags can tell you about each read, and allow for summaries on the file, and filtering.

The appropriate tool can easily manipulate bam files.

i.e., samtools, picard

samtools flagstat file.bam

SAM / BAM / CRAM

CIGAR can encode the alignment

SAM / BAM / CRAM

MAPQ can encode the Mapping Quality

\( -10 log_{10} Pr \{mapping\ position\ is\ wrong\} \)

255 indicates no mapping quality is availible

SAM / BAM / CRAM

CIGAR can encode the alignment

SAM / BAM / CRAM

CIGAR can encode the alignment

BED

BED (Browser Extensible Data)

Simple format to describe intervals on the genome

BED

Simple format to describe intervals on the genome

Basic form is 3 columns

  • Chromosome Name
  • Chromosome Start
  • Chromosome End

BED

  • Chromosome Name
  • Chromosome Start
  • Chromosome End

start is 0 based

end is 1 based

the first 100 bases on chromosome 1 would be represented with

chr1 0 100

and the next 100 bases

chr1 100 200

BED

  • Chromosome Name
  • Chromosome Start
  • Chromosome End

Optional Fields:

Name, Score, Strand, thickStart, thickEnd, itemRgb, blockCount, blockSizes, blockStarts

BED

Optional Fields

Name, Score, Strand, thickStart, thickEnd, itemRgb, blockCount, blockSizes, blockStarts

BED

Optional Fields

Name, Score, Strand, thickStart, thickEnd, itemRgb, blockCount, blockSizes, blockStarts

BED

Optional Fields

Name, Score, Strand, thickStart, thickEnd, itemRgb, blockCount, blockSizes, blockStarts

BED

What software use bed files?

  • Alignment viewers can use these data to graphically display certain features.
  • bedtools uses this format to query for nearby features.
  • Some annotation files are in this format.
  • Feature detection packages use this as output. i.e. StatePaintR

BED

Many additional derivitives, many from ENCODE

narrowPeak

broadPeak

gappedPeak

etc

GTF

GTF (Gene Transfer Format)

Mostly used to describe Genes.

GTF

GTF (Gene Transfer Format)

Mostly used to describe Genes.

First 8 fields are required:

  1. seqname - like chromosome name
  2. source - the program that generated the feature
  3. feature - some standard names include 5UTR, CDS, exon, transcript
  4. start - starts at 1 this time ¯\_(ツ)_/¯
  5. end
  6. score
  7. strand
  8. frame

GTF

GTF (Gene Transfer Format)

Mostly used to describe Genes.

First 8 fields are required:

GTF

9th column

Required:

gene_id “ENSG00000227232.5”; transcript_id “ENST00000488147.1”;

Optional:

gene_type “unprocessed_pseudogene”; gene_name “WASH7P”;

transcript_type “unprocessed_pseudogene”; transcript_name “WASH7P-001”;

exon_number 11; exon_id “ENSE00001843071.1”;

level 2; transcript_support_level “NA”;

ont “PGO:0000005”; tag “basic”;

havana_gene “OTTHUMG00000000958.1”; havana_transcript “OTTHUMT00000002839.1”;

GTF

What uses GTF?

Any tool that requires information about gene position for analysis such as:

VCF

VCF (Variant Calling Format)

Describes SNVs and INDELs

VCF

VCF (Variant Calling Format)

Describes SNVs and INDELs

  • Single Nucleotide Variants, like SNPs and point mutations
  • Insertions, deletions, other sequence variations

VCF

VCF (Variant Calling Format)

Another complex format, but has an official specification

8 required Fields

VCF

VCF (Variant Calling Format)

8 required Fields:

  1. Chromosome Name
  2. Position
  3. ID
  4. Reference base(s)
  5. Alternate base(s)
  6. Variant Quality
  7. Filter
  8. Info

VCF

VCF (Variant Calling Format)

8 required Fields:

VCF

VCF (Variant Calling Format)

What uses VCF?